(1959)
Field of study that gives computers the ability to learn without being explicitly programmed
(1998)
A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T as measured by P, improves with E
Source for further reading
Supervised learning algorithms make predictions based on a set of examples. For example, historical sales can be used to estimate the future prices. With supervised learning, you have an input variable that consists of labeled training data and a desired output variable. You use an algorithm to analyze the training data to learn the function that maps the input to the output. This inferred function maps new, unknown examples by generalizing from the training data to anticipate results in unseen situations.
Classification
When the data are being used to predict a categorical variable, supervised learning is also called classification. This is the case when assigning a label or indicator, either dog or cat to an image. When there are only two labels, this is called binary classification. When there are more than two categories, the problems are called multi-class classification.
Regression
When predicting continuous values, the problems become a regression problem.
Forecasting
This is the process of making predictions about the future based on the past and present data. It is most commonly used to analyze trends. A common example might be estimation of the next year sales based on the sales of the current year and previous years.
The challenge with supervised learning is that labeling data can be expensive and time consuming. If labels are limited, you can use unlabeled examples to enhance supervised learning. Because the machine is not fully supervised in this case, we say the machine is semi-supervised. With semi-supervised learning, you use unlabeled examples with a small amount of labeled data to improve the learning accuracy.
When performing unsupervised learning, the machine is presented with totally unlabeled data. It is asked to discover the intrinsic patterns that underlies the data, such as a clustering structure, a low-dimensional manifold, or a sparse tree and graph.
Clustering
Grouping a set of data examples so that examples in one group (or one cluster) are more similar (according to some criteria) than those in other groups. This is often used to segment the whole dataset into several groups. Analysis can be performed in each group to help users to find intrinsic patterns.
Dimension reduction
Reducing the number of variables under consideration. In many applications, the raw data have very high dimensional features and some features are redundant or irrelevant to the task. Reducing the dimensionality helps to find the true, latent relationship.
Reinforcement learning analyzes and optimizes the behavior of an agent based on the feedback from the environment. Machines try different scenarios to discover which actions yield the greatest reward, rather than being told which actions to take. Trial-and-error and delayed reward distinguishes reinforcement learning from other techniques.
Learning Theory
Study of how and why (mathematically) a learning algorithm works
Anything in the world that we believe to have any impact on the value we are trying to predict (and can be measured).
A sample is a collection of values given to the features for a single result (predicted or known).
If we are trying to predict how many hairs you have on your head basd on Age, Weight, Gender, Niegborhood, Soccer Team.
Each person we measure the Age, Weight, ... is a sample.
Each sample is a vector and we write it as $X=[x_1, x_2, ..., x_n]$
To list the samples we use an upper index and write them as a column vector:
\begin{align} \begin{bmatrix} X^0 \\ X^1 \\ \vdots \\ X^m \end{bmatrix} &= \begin{bmatrix} [x_1^0,x_2^0,...,x_n^0] \\ [x_1^1,x_2^1,...,x_n^1] \\ \vdots \\ [x_1^m,x_2^m,...,x_n^m] \\ \end{bmatrix} \end{align}
In our example, gender: $g\in[0,1]$ and weight: $wgt\in[20,180]$.
Scaling is the action of making the magnitude of all features similar.
A good normalization method would be:
\begin{equation*} x_n = \frac{x-x_m}{x_M-x_m} \end{equation*}
where
In the case of suppervised learning, the target or label is the actual result for a sample.
For Age=45, Weight=90, Gender=Female, Neigborhood=Villa Tachito, Soccer Team=Boca Jrs the Target would be 100 hairs on the head.
We symbolyze the targets for all our samples with vector $Y$:
\begin{align} \begin{bmatrix} y^0 \\ y^1 \\ \vdots \\ y^m \end{bmatrix} &= Y \end{align}
$H_\theta(X) = b + \theta_1*x_1 + ... + \theta_k*x_k=y$
Where $X=[x_i]$ is a single sample. The features/parameters we use to predict a result. Age, weight, gender, neigborhood, etc.
$\theta_i$ are the weights we give to each parameter. These are the values we change during learning ittereations to improve the accuracy of our estimation.
$b$ is the bias, a value we use to fine tune the hypothesis. We usually use $\theta_0$ and $x_0=1$ so we get $b=\theta_0*x_0$
$y$ is the value we predicted using the sample $X$ and the weights $\theta_i$.
So we get:
\begin{align} \begin{bmatrix} X^0 \\ X^1 \\ \vdots \\ X^m \end{bmatrix} &. \begin{bmatrix} \theta_0 \\ \theta_1 \\ \vdots \\ \theta_n \end{bmatrix} &= \begin{bmatrix} [x_0^0,x_1^0,...,x_n^0] \\ [x_0^1,x_1^1,...,x_n^1] \\ \vdots \\ [x_0^m,x_1^m,...,x_n^m] \\ \end{bmatrix} &. \begin{bmatrix} \theta_0 \\ \theta_1 \\ \vdots \\ \theta_n \end{bmatrix} &= \begin{bmatrix} h_{\theta}(x^1) \\ h_{\theta}(x^2) \\ \vdots \\ h_{\theta}(x^m) \\ \end{bmatrix} \end{align}
So, the hypothesis predicts a value on one hand, on the other we have the real value for a sample (in the training set).
There is a diffrence between the hypothesis and the target value.
The cost function defines a metric to decide what is the distance between the target and the hypothesis.
The most trivial cost function would be RMS (root mean square) but it is not always the best
The process of learning, in many algorithms, concists on reducing the cost function. It is done by calculating the hypothesis for a value, comparing it with the target, updating the weights and trying again. Doing this until the cost function reaches a minimum.
Note, the cost function is a function of the weights $\theta_i$ and not the samples.
For example:
Note:
We ussually mark with $n$ the number of features and with $m$ the number of samples
These are the parameters we use to conifugre the algorithm itself. Even-though the hyperparams are not a function of the features or their values, they depend on the values.
Looking for the set of hyperparameters that turn to be the best for our use of the algorithm. There are many ways to perform Grid Search.
One popular Grid Search method is called k-folds. In which we take the training data, split it into a training set and a testing set.
One round of cross-validation involves partitioning a sample of data into complementary subsets, performing the analysis on one subset (called the training set), and validating the analysis on the other subset (called the validation set or testing set). To reduce variability, in most methods multiple rounds of cross-validation are performed using different partitions, and the validation results are combined (e.g. averaged) over the rounds to estimate a final predictive model. Wikipedia
Underfitting occurrs when the trained model is too general, meaning it would predict partially well, but would miss many samples due to a bit higher variance.
Overfitting happens when our trained model is too specific to the training data. Meaning it will only predict correctly data that is too similar to the training set and would miss predictions on a more generalized data set.
There are problems which involve hundreeds of features. Dimension reduction is the action of defining new, less, features that are enough to describe and predict our target.
Dimension reduction is useful for data visualization, to ease the understanding of the data. It can also be useful to improve the prediction performance.
There are many methods, here I state two. PCA and tSNE.
Principal Component Analysis consists on looking for the features with the highest variance and using them as the new reduced features.
t-Distributed Stochastic Neighbor Embedding is a recursinve method to reduce dimensions by searching for near neighboors.
Interesting Note
Even-though you can embed the data into any number of dimensions, it won't necesarily improve the embedding as one would expect.
Beautiful hands on demo.
This is the page where the method inventor centralizes all info about it.
This Demo is much easier to understand example. Taken lots of texts, the tSNE decides which words are related to each other (or photos).
Creating ensembles involves aggregating the results of different models.
n_samples (i.e. the sum of squares of each column totals 1).# print(__doc__)
# Code source: Jaques Grobler
# License: BSD 3 clause
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score
# Load the diabetes dataset
diabetes = datasets.load_diabetes()
# display(diabetes.DESCR)
# Use only one feature
diabetes_X = diabetes.data[:, np.newaxis, 2]
# Split the data into training/testing sets
diabetes_X_train = diabetes_X[:-20]
diabetes_X_test = diabetes_X[-20:]
# Split the targets into training/testing sets
diabetes_y_train = diabetes.target[:-20]
diabetes_y_test = diabetes.target[-20:]
# Create linear regression object
regr = linear_model.LinearRegression()
# Train the model using the training sets
regr.fit(diabetes_X_train, diabetes_y_train)
# Make predictions using the testing set
diabetes_y_pred = regr.predict(diabetes_X_test)
# The coefficients
print('Coefficients: \n', regr.coef_)
# The mean squared error
print("Mean squared error: %.2f"
% mean_squared_error(diabetes_y_test, diabetes_y_pred))
# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % r2_score(diabetes_y_test, diabetes_y_pred))
# Plot outputs
plt.scatter(diabetes_X_test, diabetes_y_test, color='black')
plt.plot(diabetes_X_test, diabetes_y_pred, color='blue', linewidth=3)
plt.xlabel('Patient Weight')
plt.ylabel('1 Year Disease Progression')
plt.xticks(())
plt.yticks(())
plt.show()
To think together
Let's say we are a real estate agent in Alaska.
We want to predict the price of an Igloo given it's radius.
We believe there is a linear relation between the area of the Igloo and it's price
The thing is the area of the igloo is proportional to $r^2$
How do we find a linear regression to predict igloo price based on its radius?
In a desicion tree you lay out options for each feature and investigate the possible outcomes of choosing those options.
The idea of a decision tree is to divide the data set into smaller data sets based on the descriptive features until you reach a small enough set that contains data points that fall under one label/target. From this explanation
In the case you have many options for a feature (or it's a continuous feature such as time) you slit the possible values into cathegories or boundaries and test the values against this quantization.
A method to reduce the tree depth by detecting branches and paths that do not contribute to the outcome.
Prunning is espcially useful to reduce overfitting (but not only)
Random forest is an ensemble of a few Desicion trees.
k Nearest Neighbors
The idea is to predict the label of a sample by looking at the nearest samples in the training data.
The distance is easy to imagine on two or three dimensions. In more dimensions you need a metric to define distance between samples. For example: Hamming distance, Large Margin Nearest Neighbor or Neighbourhood components analysis.
$k$ is the number of closest samples to test the value. If you have two classes, $k$ should be odd to avoid tie.
In a more general speaking, $k$ should not be a product of the number of classes.
Look at the impact of the value you chose for $k$:
What happens when k=3 and when k=5?
A drawback of the basic "majority voting" classification occurs when the class distribution is skewed
Other drawback is that to find the closest samples you need to calculate the distances to lots of samples, which may be very time consuming for large $D$
Support Vector Machine
SVM detects a road between the classes of the samples. Then predicts by deciding on which side of the road is the sample.
It doesn't just detect the road, but the widest possible road:
Hyperplane
Is the center of the road the separates between the classes. The hyperplance is always one dimnesion less than the number of features, the dimension of the data.
Kernel
What happens when the data cannot be separated by a linear hyperplane?
The kernel is the function that determines the shape of the hyperplane:
in this example the best choice for the kernel would be RBF, Radial Basis Function.
There are a few standard kernels. Yet, you can define any function to be your kernel. If you go deeper into Kernel you'll see it is not necesary to solve the function thus your can chose a very nasty kernel if it suits you data shape.
Feature Space vs Input Space
These two become confusing when discussing SVMs. Even they are similar they are different, Same Same but Different...
Feature space refers to the n-dimensions where your variables live (not including a target variable, if it is present). The term is used often in ML literature because a task in ML is feature extraction, hence we view all variables as features. For example, consider the data set with:
The feature space is $R^3$, or more accurately, in $R^{3+}$ (the positive quadrant) as all the $X$ variables can only be positive quantities. Domain knowledge about tires might suggest that the speed the vehicle was moving at is important, hence we generate another variable, $x_4$ (this is the feature extraction part):
$x_4≡\frac{x_1}{x_2}$ the speed of the vehicle during testing.
Mapping the input space into feature space.
The input is mapped into variables through the function $\phi$, from $R^3$ to $R^4$:
$\phi(x_1,x_2,x_3)=(x_1,x_2,x_3,\frac{x_1}{x_2})$
Example
Using a Kernel to compute a non-linearly separable input into a higher dimension linearly separable features.
More
Here you can find an amazing lecture with the math behind SVMs.
ANNs
Computing systems inspired by the biological neural networks of the brain. The performance progressively improves on tasks (i.e. Learns) by considering examples, generally without task-specific programming.
For example, in image recognition. The examples could be photos labeled as 'CAT' or 'NOT CAT'. After learning many examples the performance is measured by how many times it recognized correctly whether there is a cat or not.
The learning process should be generic (regarding the content of the images). In the process evolves a set of relevant characteristics from the learning material that is processed.
In common ANN implementations, the signal at a connection between neurons is a real number, and the output of each neuron is calculated by a non-linear function of the sum of its inputs. Neurons and connections typically have a weight that adjusts as learning proceeds.
An ANN is based on a collection of connected units or nodes called neurons. Each connection between neurons transmits a value from one to another. The neuron that receives the signal can process it and then signal neurons connected to it.
Is the importance / relevance each input to neuron has. The weight values are changed on each learning itteration.
The inputs of one layer of neurons are connected to the inputs of the next layer.
What if we need the inputs of the neuron to comply with some limitations? To be only possitive, for example.
In such cases between layers we add an activation layer, which is, a layer that all that it does is transform the outputs of one layer to correct data for the next layer.
The most popular activation function today is the Rectifier Linear Unit. What it does is to allow only possitive inputs:
$f(x) = x^+ = max(0,x)$
$x$ is the input to the neuron.
Softmax is another popular activation function that reduces any real value to the close segment [0,1]:
Deep network is an ensemble of many layers of neural networks.
Deep nets have outstanding results when trained. The thing is they require huge data sets till they reach their potential performance.
Back Propagation
To improve performance of a network, erros flow in the opposite direction to the data to correct each network weights.
Transfer learning or inductive transfer is a research problem in machine learning that focuses on storing knowledge gained while solving one problem and applying it to a different but related problem.
For example, knowledge gained while learning to recognize cars could apply when trying to recognize trucks.
Artificial neural networks where the connections between units do not form a cycle. Feedfoward neural networks were the first type of artificial neural network invented and are simpler than their counterpart, recurrent neural networks. They are called feedforward because information only travels forward in the network (no loops), first through the input nodes, then through the hidden nodes (if present), and finally through the output nodes.
From this amazing website about mathematics.
The idea behind RNNs is to make use of sequential information. In a traditional neural network we assume that all inputs (and outputs) are independent of each other. But for many tasks that’s a very bad idea. If you want to predict the next word in a sentence you better know which words came before it. RNNs are called recurrent because they perform the same task for every element of a sequence, with the output being depended on the previous computations. Another way to think about RNNs is that they have a “memory” which captures information about what has been calculated so far. In theory RNNs can make use of information in arbitrarily long sequences, but in practice they are limited to looking back only a few steps. Here is what a typical RNN looks like:
A recurrent neural network and the unfolding in time of the computation involved in its forward computation.
Taken from this greate tutorial
Example
LSTM Long Short Term Memory(Primitive version of RNNs) used to play music. Duet between computer and human.
A neural network that is trained to attempt to copy its input to its output. Internally, it has a hidden layerh that describes a code used to represent the input. The network may be viewed as consisting of two parts: an encoder functionh=$f(x)$ and a decoder that produces a reconstructionr=$g(h)$. If an autoencoder succeeds in simply learning to set $g(f(x)) =x$ everywhere, then it is not especially useful. Instead, autoencoders are designed to be unable to learn to copy perfectly. Usually they are restricted in ways that allow them to copy only approximately, and to copy only input that resembles the training data. Because the model is forced to prioritize which aspects of the input should be copied, it often learns useful properties of the data.
From this online book
A neural network that uses filters to detect features in the input.
The CNNs are quiet complicated. A beatiful explanation can be found here.
Let's say we have a one bit image (black and white)
(this image is black where 1 and white where 0)
Now we define a filter (AKA kernel or feature detector)
Now we scan our image with the filter and write how many times the 1's in the filter are also 1 in the image:
The matrix formed by sliding the filter over the image and computing the dot product is called the Convolved Feature or Activation Map or the Feature Map. It is important to note that filters act as feature detectors from the original input image.
To see how different filters have different impact on images see this site.
Different filters detect different features:
What you see is how two filters detect different features on the same image.
Now, once you have the features that were extracted from a labeled image, you can use them to look for them on an unlabled image and make desicions accordingly.
Example
A fun game demonstrating CNN use.
Unsupervised machine learning is the machine learning task of inferring a function to describe hidden structure from "unlabeled" data.
Since the examples given to the learner are unlabeled, there is no evaluation of the accuracy of the structure that is output by the relevant algorithm—which is one way of distinguishing unsupervised learning from supervised learning and reinforcement learning.
RL Learning algorithm based on taking actions in an environment so as to maximize some notion of cumulative reward.
The problem, due to its generality, is studied in many other disciplines
This is a collective problem the players need to solve together.
In the image above, thanks to P1 they got additional 1.1 points and thanks to P2 they got 3.7 points.
Parametrization
Real World Example
The ALOHA network problem
When the guys at Rephael used RL for solving the quest, they tried three approaches:
The Results
For General Reward, since everybody go the same points, the optimal procedure was that two couples transmit all the time and all the rest shut up. (Comunism doesn't work in channel optimization)
When agents were privately rewarded all participants had to find how to devide the time slots so everybody can transmit. (Homo homini lupus - i.e. Adam le Adam Zeev)
When the reward was Proportional to the messages all participants sent, the strategy found was to hold the channel until somebody else attempts to transmit (a packet is dropped - for those who are still awake) and then release the channel for the other agent.
The Optimal channel utilization was achieved when each agent was rewarded privately.
For further reading here is a research regarding this example and a lecture based on this research by the same guy. I bulit this section based on the latter.